Conversation
0cfea1d to
a3f0678
Compare
|
There are the results for a run on my dev machine (4090) |
|
It's the first time that I'm looking at this code. My second question was: What's the purpose of the benchmarks. Cursor (GPT-5.4 Extra High Fast) offered some answers. I asked it to generate a "Motivation" section based on what it found, see below. I think it'd be a great addition to MotivationThese benchmarks are intended to measure the latency overhead of calling CUDA Driver APIs through The main goal is to help answer questions such as:
The paired C++ benchmarks are included to provide a lower-level reference point for the same operation. Comparing Python and C++ results helps estimate the additional cost introduced by the Python-to-C boundary and by binding-specific marshalling work. These benchmarks are not intended to measure overall GPU performance, kernel throughput, or end-to-end application speed. Most of the benchmarked operations are deliberately tiny, so the reported numbers are best interpreted as binding/API-call latency measurements and regression signals, rather than as predictions of full application performance. Because the benchmarked operations are so small, methodology matters a lot. The most useful comparisons are between Python and C++ benchmarks that perform as nearly the same work as possible and are run under similar conditions. |
|
My first question (to Cursor) when reviewing this PR was:
After it gave me the response below I started thinking about the motivation, with the result in the previous comment. In light of that, the findings below still seem relevant, but I'd need to look closer to be more certain which of the "not clean apples-to-apples" aspects it found are actually meaningful. I hope they are at least a good starting point for figuring it out together, so I'm copy-pasting them below. Findings
What Looks Reasonably Matched
Bottom Line
Note
|
|
yeah the motivation is correct, its just latency/overhead of the python layer, not throughput. I'll add it to the readme. And yeah the review i think i would agree with most of it on the ones marked as high (and I will try to match them closer) but for the other ones i think its almost impossible to do a full apples-to-apples so for those i am not sure i would change much but i'll leave it up to you all to make that call :D |
|
Ok, i added a couple of About the second comment, i think its "ok". the C++ one doesn't match pyperf fully when it does a bit more fancy stats for the warm up and number of measurements and the C++ is a fixed count but i don't think it should affect much specially for measuring host latency? |
Description
closes #1580
Follow up #1580
Adding a couple of more benchmarks here and fixing a couple of issue with the pyperf json handling.
Checklist